pacman::p_load(tmap, sf, sfdep, tidyverse, DT, leaflet, stplanr, sp, reshape2, wdpar, patchwork)
# devtools::install_github("thomasp85/patchwork")
# library(patchwork)Take-home Exercise 2
0. Goals & Objectives.
Our goal is to conduct a case study to demonstrate the potential value of Geospatial Data Science & Analysis, by integrating publicly available data from multiple sources to build a spatial interaction models, to determine factors affecting urban mobility patterns of public bus transit.
Recentdeployments of massively pervasive computing technologies such as in-vehicle GPS and SMART cards by public transport provide plenty of data for tracking movement patterns through space and time, but the explosive growth of data has outstripped public services’ ability to utilise, transform, and understand the data.
More detail about the task from: https://isss624-ay2023-24nov.netlify.app/take-home_ex02
Modifible Areal Unit Problem (MAUP): - MPSZ is too coarse, too huge a subzone area; people may live only in a corner of the subzone - Planning subzones too large, so we use analytical hexagon
“Map is not interesting, pattern revealed by and factors affecting the map is interesting. Can we explain this by building a spatial model?” - Building a model to explain flows; Spatial Interaction Model
GLM: Generalised Linear-Regression Model (over linear model):
- Dwelling Units as proxy for population;
- Can compare HDB-only VS dwelling units vs room-flat vs 1/2/3/ room flat unit
- POI: points of interest name & type
Due to the nature of EDA and Data Analysis, parts of this page have been Collapsed or placed behind tabs, to avoid excessive scrolling.
For easier reading, this study is also presented in point-form.
0.1 Dataset used
0.1.1 Aspatial Dataset
| Dataset, Purpose & Source: | Key columns |
|---|---|
Via data.gov |
|
Via LTA DataMall |
|
0.1.2 Geospatial Dataset
| Filename, Purpose & Source: | Key columns |
|---|---|
Via data.gov |
|
Via LTA DataMall |
|
1. Geospatial Data Wrangling
This study was performed in R, written in R Studio, and published using Quarto.
1.1 Import Packages
This function calls pacman to load sf, tidyverse, tmap, knitr packages;
tmap: For thematic mapping; powerful mapping package;sf: for geospatial data handling, but also geoprocessing: buffer, point-in-polygon count, etc;sp: forspDists()at high speedreshape2: formelt()instead ofpivot_longer()sfdep: useful functions for creating weight matrix, LISA calculations etc;tidyverse: for non-spatial data handling; commonly used R package and containsdplyrfor dataframe manipulation andggplotfor data visualization;DT: for displaying datatables;leaflet: for custom layer controls overtmapvisualisations.stplanar: for creating desire lines to visualize O/D flows
1.2 Import Geospatial Data
1.2.1 raw_bus_stop_sf: Load Geospatial Bus Stop Data
- First, we load
BusStopshapefile data from LTA Datamall; st_read()is used to import the ESRI Shapefile data into ansfdataframe.- From previous take-home exercise, we know BusStop has the incorrect CRS (coordinate reference system), as EPSG 9001 instead of 3414, so we use
st_set_crs()to correct this - We use
head()to preview the first 6 rows
- From previous take-home exercise, we know BusStop has the incorrect CRS (coordinate reference system), as EPSG 9001 instead of 3414, so we use
show code
raw_bus_stop_sf <- st_read(dsn = "data/geospatial",
layer = "BusStop") %>%
st_set_crs(3414)Reading layer `BusStop' from data source
`C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
that
show code
head(raw_bus_stop_sf)Simple feature collection with 6 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 13228.59 ymin: 30391.85 xmax: 41603.76 ymax: 44206.38
Projected CRS: SVY21 / Singapore TM
BUS_STOP_N BUS_ROOF_N LOC_DESC geometry
1 22069 B06 OPP CEVA LOGISTICS POINT (13576.31 32883.65)
2 32071 B23 AFT TRACK 13 POINT (13228.59 44206.38)
3 44331 B01 BLK 239 POINT (21045.1 40242.08)
4 96081 B05 GRACE INDEPENDENT CH POINT (41603.76 35413.11)
5 11561 B05 BLK 166 POINT (24568.74 30391.85)
6 66191 B03 AFT CORFE PL POINT (30951.58 38079.61)
- We check the coordinate reference system with
st_crs(); and see that it is now indeed correctly set to 3414:
st_crsReadout
show code
st_crs(raw_bus_stop_sf)Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
- We use
qtm()to perform a quick visualization of the various bus stops:
show code
tmap_mode("plot")tmap mode set to plotting
show code
qtm(raw_bus_stop_sf)
1.2.2 mpsz_sf: Visualizing Singapore’s Master Plan 2019 Subzone Boundaries
We now load Master Plan 2019 Subzone Boundary
Next, we load
MPSZ-2019shapefile data from Data.gov.sg, comprising Master Plan 2019 Subzone Boundary (No Sea) data ;st_read()is used to import the ESRI Shapefile data into ansfdataframe;- We use
st_crs()to check the Coordinate Reference System (CRS);
- We use
show code
mpsz_sf <- st_read(dsn = "data/geospatial",
layer = "MPSZ-2019") Reading layer `MPSZ-2019' from data source
`C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS: WGS 84
show code
st_crs(mpsz_sf)Coordinate Reference System:
User input: WGS 84
wkt:
GEOGCRS["WGS 84",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["latitude",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["longitude",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4326]]
- We see that
MPSZ-2019is set to WGS 84/EPSG 4326 projection, which is inconsistent with the rest of our dataset; - Thus, we use
st_transform()to reproject the data to SVY21/EPSG 3414 Coordinate Reference System (CRS) to ensure consistent distance calculation:
show code
mpsz_sf <- mpsz_sf %>%
st_transform(crs=3414)
st_crs(mpsz_sf)Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
- Now, let’s visualize the bus stop within the
mpszsubzone boundaries; - We set
tmap_mode("plot")to allow us to scroll; - We use
tm_shape() + tm_polygons()to map a base layer of thempszboundaries;- On top of which, we layer
tm_shape() + tm_dots()to show the bus stops.
- On top of which, we layer
show code
tmap_mode("plot")tmap mode set to plotting
show code
tmap_options(check.and.fix = TRUE)
tm_shape(mpsz_sf) +
tm_polygons() +
tm_shape(raw_bus_stop_sf) +
tm_dots()Warning: The shape mpsz_sf is invalid. See sf::st_is_valid

- We note there are a number of bus stops outside Singapore’s boundaries; Specifically, three bus stops in a cluster just outside the Causeway, and one further North.
- We perform several steps to isolate & check the data;
- we use
st_filter()to find bus stops within Singapore’s Administrative National Boundaries, and createsg_bus_stop_sffor future use. - to check what bus stops have been dropped, we use
anti_join()to compareraw_bus_stop_sfwithsg_bus_stop_sf. We usest_drop_geometryon bothsfdataframes to only compare the non-geometry columns.
- we use
show code
sg_bus_stop_sf <- st_filter(raw_bus_stop_sf, mpsz_sf)
anti_join(st_drop_geometry(raw_bus_stop_sf), st_drop_geometry(sg_bus_stop_sf))Joining with `by = join_by(BUS_STOP_N, BUS_ROOF_N, LOC_DESC)`
BUS_STOP_N BUS_ROOF_N LOC_DESC
1 47701 NIL JB SENTRAL
2 46239 NA LARKIN TER
3 46609 NA KOTARAYA II TER
4 46211 NIL JOHOR BAHRU CHECKPT
5 46219 NIL JOHOR BAHRU CHECKPT
- We see there are in fact 5 bus stops outside of Singapore (including the defunct Kotaraya II Terminal) that have been removed, which means our data cleaning was correct.
1.3 Geospatial Data Cleaning
1.3.1 Removing Duplicate Bus Stops
- From Take-home Exercise 1, we know that there are a number of repeated bus stops. We repeat some steps;
- We use
length()to find the total number of raw values in the$BUS_STOP_Ncolumn ofsg_bus_stop_sf;- We then compare this to
length(unique())to find the unique values;
- We then compare this to
- And, indeed, we find there are 16 bus stops that are repeated;
cat("\nResults before removing duplicates: \n=======================================================\n")
Results before removing duplicates:
=======================================================
cat("Total number of rows in sg_bus_stop_sf\t\t: ", paste0(length(sg_bus_stop_sf$BUS_STOP_N)))Total number of rows in sg_bus_stop_sf : 5156
cat("\nTotal unique bus stop IDs in sg_bus_stop_sf\t: ", paste0(length(unique(sg_bus_stop_sf$BUS_STOP_N))))
Total unique bus stop IDs in sg_bus_stop_sf : 5140
cat("\nRepeated bus stops\t\t\t\t: ", paste0(length(sg_bus_stop_sf$BUS_STOP_N) - length(unique(sg_bus_stop_sf$BUS_STOP_N))))
Repeated bus stops : 16
- It appears there are 16 datapoints that are specifically repeated; let’s remove them by deleting the duplicated rows:
- we use
duplicated()to identify rows with repeated values of$BUS_STOP_N; duplicated rows will returnTRUEwhile all other rows will returnFALSE - We use
!to invert the values, so only the unduplicated rows will returnTRUE. - We then use square brackets
[]to indexsg_bus_stop_sfbased on the rows, and return only the unduplicated rows; - We then assign the output using
<-intobus_stop_sf, for use.
- we use
show code
bus_stop_sf <- sg_bus_stop_sf[!duplicated(sg_bus_stop_sf$BUS_STOP_N), ]
head(bus_stop_sf)Simple feature collection with 6 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 13228.59 ymin: 30391.85 xmax: 41603.76 ymax: 44206.38
Projected CRS: SVY21 / Singapore TM
BUS_STOP_N BUS_ROOF_N LOC_DESC geometry
1 22069 B06 OPP CEVA LOGISTICS POINT (13576.31 32883.65)
2 32071 B23 AFT TRACK 13 POINT (13228.59 44206.38)
3 44331 B01 BLK 239 POINT (21045.1 40242.08)
4 96081 B05 GRACE INDEPENDENT CH POINT (41603.76 35413.11)
5 11561 B05 BLK 166 POINT (24568.74 30391.85)
6 66191 B03 AFT CORFE PL POINT (30951.58 38079.61)
show code
cat("\nResults after removing duplicates: \n=======================================================\n")
Results after removing duplicates:
=======================================================
show code
cat("Total number of rows in bus_stop_sf\t\t: ", paste0(length(bus_stop_sf$BUS_STOP_N)))Total number of rows in bus_stop_sf : 5140
show code
cat("\nTotal unique bus stop IDs in bus_stop_sf\t: ", paste0(length(unique(bus_stop_sf$BUS_STOP_N))))
Total unique bus stop IDs in bus_stop_sf : 5140
show code
cat("\nRepeated bus stops\t\t\t\t: ", paste0(length(bus_stop_sf$BUS_STOP_N) - length(unique(bus_stop_sf$BUS_STOP_N))))
Repeated bus stops : 0
- We can do a quick check to visualize these;
- We use
tmap_mode("view")to allow us to scroll around and check that the bus stops fall within Singapore’s national boundaries, and set zoom limits to focus the attention - We combine
tm_shape()andtm_polygons()to map the master plan subzones in grey; - We combine
tm_shape()andtm_dots()to map locations of bus stops; for visual distinction with the grey zones, we use the “Spectral” palette
- We use
show code
tmap_mode("view")tmap mode set to interactive viewing
show code
tm_shape(mpsz_sf) +
tm_polygons() +
tm_shape(bus_stop_sf) +
tm_dots(col = "BUS_STOP_N", palette = "Spectral", legend.show = FALSE) +
tm_view(set.zoom.limits = c(11, 13))Warning: The shape mpsz_sf is invalid (after reprojection). See sf::st_is_valid
Warning: Number of levels of the variable "BUS_STOP_N" is 5140, which is larger
than max.categories (which is 30), so levels are combined. Set
tmap_options(max.categories = 5140) in the layer function to show all levels.
show code
tmap_mode("plot")tmap mode set to plotting
- Now, we can start preparing the hexagon map.
1.4 Generating Hexagon Maps
- We generate the hexagon map in three steps:
- We use
st_make_grid()withsquare = FALSEto create the hexagon layer as defined in the study, which we nameraw_hex_grid;- We pass
cellsize = 750to create the hexagons of necessary size. Prof Kam defined the apothem as 375m, as the Traffic Analysis Zone is typically 750m in size. - I used
units::as_unitsto pass 750 metres into the argument. I am still uncertain whether a length of 750m needs to be reprojected, or whether we need to do any further transformation. - We use
st_transform()just in case we need to reproject the coordinate system, just in case.
- We pass
- We use
st_sf()to convertraw_hex_gridinto ansfdataframe for further manipulation later; - However, trying to visualize this right now just gives us a map full of hexagons, so need to eliminate the empty hexagons; - We usemutate()to create a new column,$n_bus_stopsthat counts the number of bus stops in each hexagon usinglengths(st_intersects())st_intersects()gives us a list of bus stops in each hexagon, so we uselengths()to count the number
- We create our final
sfdataframe,hexagon_sfin two steps; - First, we usefilter()to select only hexagons with nonzero number of bus stops; - Then,mutate()is used here to create agrid_idcolumn, labelling only the hexagons with nonzero bus stops.
- Finally, we perform a quick plot to confirm that every bus stop is inside a hexagon;
- Using
hexagon_sfas our base layer, we usetmap_shape()andtm_polygons()to visualizes our hexes. We passpalette = "Spectral"for visual distinguishment against the black dots of the bus stops. - We use
tmap_shape()andtm_dots()to visualize our bus stop as black dots
- Using
show code
# STEP 1 - Create hexagon map
raw_hex_grid <- st_make_grid(bus_stop_sf, cellsize = units::as_units(375, "m"), what = "polygons", square = FALSE) %>%
st_transform(crs = 3414)
# STEP 2 - Convert to sf object and count the number of bus stops inside;
raw_hex_grid <- st_sf(raw_hex_grid) %>%
mutate(n_bus_stops = lengths(st_intersects(raw_hex_grid, sg_bus_stop_sf)))
# Count number of points in each grid, code snippet referenced from:
# https://gis.stackexchange.com/questions/323698/counting-points-in-polygons-with-sf-package-of-r
# STEP 3 - Filter for nonempty hexes and label:
hexagon_sf <- filter(raw_hex_grid, n_bus_stops > 0)
hexagon_sf <- hexagon_sf %>%
mutate(grid_id = 1:nrow(hexagon_sf)) %>%
select(grid_id, raw_hex_grid, n_bus_stops)
tmap_mode("plot")tmap mode set to plotting
show code
tm_shape(hexagon_sf) +
tm_polygons(col = "grid_id", palette = "Spectral", legend.show = FALSE) +
tm_shape(bus_stop_sf) +
tm_dots()
show code
tmap_mode("plot")tmap mode set to plotting
- We can visually confirm that every black dot is within a hexagon.
- We re-use a bit of code from Take-Home Exercise 1 to plot our analytical hexagons on a map of Singapore and visualize the number of bus stops in each hexagon:
show code
tmap_mode("view")tmap mode set to interactive viewing
show code
tm_basemap(providers$OneMapSG.Grey) +
tm_shape(hexagon_sf) +
tm_fill(
col = "n_bus_stops",
palette = "-plasma",
style = "cont",
breaks = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
title = "Number of bus_stops",
id = "grid_id",
showNA = FALSE,
alpha = 0.6,
popup.vars = c(
"Number of Bus Stops: " = "n_bus_stops"
),
popup.format = list(
n_bus_stops = list(format = "f", digits = 0)
)
) +
tm_borders(col = "grey40", lwd = 0.7) +
tm_view(set.zoom.limits = c(11, 13))show code
tmap_mode("plot")tmap mode set to plotting
- We perform some simple stats to count the total number of filtered hexagons, and to see the maximum number of bus stops in a hexagon.
- Unlike Take-Home Exercise 1, the number of hexagons have decreased, and the maximum number of bus stops per hexagon is lower.
show code
# NB: Code reused from my own take-home exercise 1
cat(paste("Total number of raw hexagons is\t\t\t: ", nrow(raw_hex_grid), "\n"))Total number of raw hexagons is : 8843
show code
cat(paste("Total number of hexagons (n_bus_stop > 1) is\t: ", nrow(hexagon_sf)), "\n")Total number of hexagons (n_bus_stop > 1) is : 2172
show code
cat("\nPrinting map_hexagon_sf:\n >> ")
Printing map_hexagon_sf:
>>
show code
hexagon_sf[hexagon_sf$n_bus_stops > 5, ]Simple feature collection with 47 features and 2 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 10532.62 ymin: 29837.95 xmax: 42782.62 ymax: 47807.98
Projected CRS: SVY21 / Singapore TM
First 10 features:
grid_id n_bus_stops raw_hex_grid
149 149 6 POLYGON ((10720.12 34059.82...
170 170 6 POLYGON ((11470.12 34059.82...
191 191 6 POLYGON ((12032.62 35683.62...
214 214 6 POLYGON ((12595.12 36008.38...
217 217 6 POLYGON ((12595.12 39255.98...
297 297 6 POLYGON ((14282.62 35034.1,...
347 347 7 POLYGON ((15782.62 34384.58...
405 405 6 POLYGON ((17470.12 39905.49...
426 426 6 POLYGON ((17845.12 39905.49...
441 441 6 POLYGON ((18032.62 40230.25...
- For the next step, we’ll need to manage the aspatial bus trips dataset, which is what we’ll work on now.
1.5 Aspatial Data Wrangling: Bus trip dataset
1.5.1 Import Bus O/D Dataset
For our purposes, we will focus only on 2023-October Passenger Volume by Origin Destination Bus Stops, downloaded from LTA DataMall;
We use
read_csv()to load the data from the .csv file;We use
select()with a-sign to remove two columns redundant for our analysis:$PT_TYPEcolumn indicates the type of public transport, however, every value is “BUS”$YEAR_MONTHcolumn similarly has “2023-10” for every value, which we are aware of- With this in mind, we drop these two columns to save space.
Finally, we use
mutate_at()to convert two columns ($ORIGIN_PT_CODEand$DESTINATION_PT_CODE)from character to factor, for easier analysis.We use
str()to check the columns, datatypes, and number of rows:
show code
odbus_2310 <- read_csv("data/aspatial/origin_destination_bus_202310.csv") %>%
select( -PT_TYPE, -YEAR_MONTH) %>%
mutate_at(c("ORIGIN_PT_CODE", "DESTINATION_PT_CODE"), as.factor)Rows: 5694297 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): YEAR_MONTH, DAY_TYPE, PT_TYPE, ORIGIN_PT_CODE, DESTINATION_PT_CODE
dbl (2): TIME_PER_HOUR, TOTAL_TRIPS
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
show code
str(odbus_2310)tibble [5,694,297 × 5] (S3: tbl_df/tbl/data.frame)
$ DAY_TYPE : chr [1:5694297] "WEEKENDS/HOLIDAY" "WEEKDAY" "WEEKENDS/HOLIDAY" "WEEKDAY" ...
$ TIME_PER_HOUR : num [1:5694297] 16 16 14 14 17 17 17 7 14 14 ...
$ ORIGIN_PT_CODE : Factor w/ 5073 levels "01012","01013",..: 105 105 4428 4428 2011 834 834 781 4462 4462 ...
$ DESTINATION_PT_CODE: Factor w/ 5077 levels "01012","01013",..: 240 240 4742 4742 693 809 809 235 4002 4002 ...
$ TOTAL_TRIPS : num [1:5694297] 3 5 3 5 4 1 24 2 1 7 ...
- This is a huge
tibbledataframe with over 5 million rows, so we will filter this now by peaks; - For this study, we focus on Weekday afternoon peak
1.5.2 Filter For Peaks – Weekday Afternoon
- We now perform a multi-step filter process;
- We combine
mutate()withcase_when()to create a new column,$PEAK, based on the study criteria:- We set the value to “WEEKDAY_AFTERNOON_PEAK” if
$DAY_TYPEis “WEEKDAY” and bus tap-on time (e.g.$TIME_PER_HOUR) is between 5 pm and 8pm, inclusive;- Note that we convert the values for
$TIME_PER_HOURto 24-hour clock, e.g. “5pm” is “17” hundred hours, “8pm” is “20” hundred hours.
- Note that we convert the values for
- For all remaining values, we assign an “Unknown” value.
- We set the value to “WEEKDAY_AFTERNOON_PEAK” if
- We then use
filter()to eliminate those with “Unknown”$PEAKvalues, i.e. rows outside the period of interest for the study - We use
group_by()to group the values by$ORIGIN_PT_CODEand$DESTINATION_PT_CODE, and usesummarise()to aggregate the sum of$TOTAL_TRIPSas a new column,$TRIPS. - We use
write_rds()to save the output dataframe,odbus_filtered, as an RDS object for future reference/load.
- We combine
odbus_filtered <- odbus_2310 %>%
mutate(PEAK = case_when(
DAY_TYPE == "WEEKDAY" & TIME_PER_HOUR >= 17 & TIME_PER_HOUR <= 20 ~ "WEEKDAY_AFTERNOON_PEAK",
TRUE ~ "Unknown"
)) %>%
filter(
case_when(
PEAK == "Unknown" ~ FALSE,
TRUE ~ TRUE
)) %>%
group_by(ORIGIN_PT_CODE, DESTINATION_PT_CODE) %>%
summarise(TRIPS = sum(TOTAL_TRIPS))`summarise()` has grouped output by 'ORIGIN_PT_CODE'. You can override using
the `.groups` argument.
write_rds(odbus_filtered, "data/rds/odbus_filtered_weekday.rds")
head(odbus_filtered)# A tibble: 6 × 3
# Groups: ORIGIN_PT_CODE [1]
ORIGIN_PT_CODE DESTINATION_PT_CODE TRIPS
<fct> <fct> <dbl>
1 01012 01112 540
2 01012 01113 516
3 01012 01121 303
4 01012 01211 363
5 01012 01311 630
6 01012 01549 6
1.6 Combine Bus Trip Data With hexagon_sf Dataframe
- For our study purposes, we need to have the number of bus trips originating from each hexagon. In order to achieve this, we must combine our three current dataframes:
hexagon_sf, ansfdataframe with$grid_idcolumn (primary key) and the specific polygon geometry of each hexagon;bus_stop_sf, ansfdataframe with the$BUS_STOP_N(primary key) and the point geometry of each bus stop;odbus_filtered, atibbledataframe with the$ORIGIN_PT_CODE(primary key) column and the trip details for each of the four peak periods of interest.
1.6.1 bus_stop_hexgrid_id: Identify Hexagon grid_id For Each Bus Stop
- First, we combine
hexagon_sfwithsg_bus_stop_sf;- We use
st_intersectionto combine the dataframes such that each row ofsg_bus_stop_sfhas an associatedgrid_id; - We use
select()to filter the resultantbus_stop_hexgrid_iddataframe to only$grid_idand$BUS_STOP_Ncolumns, and usest_drop_geometry()to convert into a simple dataframe with just two columns:
- We use
show code
bus_stop_hexgrid_id <- st_intersection(bus_stop_sf, hexagon_sf) %>%
select(BUS_STOP_N, grid_id) %>%
st_drop_geometry()Warning: attribute variables are assumed to be spatially constant throughout
all geometries
show code
cat("\nNumber of bus stops\t:", length(unique(bus_stop_hexgrid_id$BUS_STOP_N)))
Number of bus stops : 5140
show code
cat("\nNumber of hexgrids\t:", length(unique(bus_stop_hexgrid_id$grid_id)))
Number of hexgrids : 2170
show code
head(bus_stop_hexgrid_id) BUS_STOP_N grid_id
3265 25059 1
3265.1 25059 2
2566 25751 3
254 26379 4
2399 26369 5
2893 25761 6
- Eagle-eyed readers may notice that BUS_STOP_N #25059 is in both
$grid_id1 & 2; - Let’s check if there’s any other duplicates:
show code
bus_stop_hexgrid_id$BUS_STOP_N[duplicated(bus_stop_hexgrid_id$BUS_STOP_N)][1] "25059"
- Only BUS_STOP_N #25059 is duplicated; let’s see why this is:
show code
tmap_mode("view")tmap mode set to interactive viewing
show code
tm_shape(hexagon_sf[hexagon_sf$grid_id %in% c(1, 2),]) +
tm_polygons(col = "grid_id", palette = "Spectral") +
tm_shape(bus_stop_sf[bus_stop_sf$BUS_STOP_N %in% c(25059),]) +
tm_dots()show code
tmap_mode("plot")tmap mode set to plotting
- It appears that BUS_STOP_N #25059 is exactly on the boundary between the two hexagons.
- Here we perform simple manual intervention data cleaning, declaring the bus stop is in
$grid_id2 by deleting the first row;- We use
subset()to select values of$grid_idthat are not equal to 1
- We use
show code
bus_stop_hexgrid_id <- subset(bus_stop_hexgrid_id, grid_id != 1)
cat("\nNumber of bus stops\t:", length(unique(bus_stop_hexgrid_id$BUS_STOP_N)))
Number of bus stops : 5140
show code
cat("\nNumber of hexgrids\t:", length(unique(bus_stop_hexgrid_id$grid_id)))
Number of hexgrids : 2169
show code
cat("\n\nPrinting rows with duplicated bus stop values\t:", bus_stop_hexgrid_id$BUS_STOP_N[duplicated(bus_stop_hexgrid_id$BUS_STOP_N)])
Printing rows with duplicated bus stop values :
show code
cat("\n\n(If empty, it worked!)")
(If empty, it worked!)
- Now we append hexagon code onto
odbus_filtered. We do this in two steps; - First, we use
left_join()to add the hexagon$grid_idbyBUS_STOP_Nnumber;- We use
rename()to rename columns, and specify the origin bus stop and origin hex grid id asORIGIN_BSandORIGIN_HEXrespectively; - We create a
duplicatetibble data.frame to check for duplicate results, and luckily we see an empty tibble
- We use
show code
od_data <- left_join(odbus_filtered , bus_stop_hexgrid_id,
by = c("ORIGIN_PT_CODE" = "BUS_STOP_N")) %>%
rename(ORIGIN_BS = ORIGIN_PT_CODE,
ORIGIN_HEX = grid_id)
duplicate <- od_data %>%
group_by_all() %>%
filter(n()>1) %>%
ungroup()
duplicate# A tibble: 0 × 4
# ℹ 4 variables: ORIGIN_BS <chr>, DESTINATION_PT_CODE <fct>, TRIPS <dbl>,
# ORIGIN_HEX <int>
show code
# od_data[!complete.cases(od_data), ]
#
# bus_stop_hexgrid_id[]
# hexagon_sf[hexagon_sf$grid_id %in% c(1, 2),]
# bus_stop_hexgrid_id[bus_stop_hexgrid_id$BUS_STOP_N %in% c(03361),]
#
# raw_bus_stop_sf[raw_bus_stop_sf$BUS_STOP_N %in% c(03361, 03549, 03579, 59009),]- Now we repeat the step, appending the destination hex ID
show code
od_data <- left_join(od_data , bus_stop_hexgrid_id,
by = c("DESTINATION_PT_CODE" = "BUS_STOP_N")) %>%
rename(DESTIN_BS = DESTINATION_PT_CODE,
DESTIN_HEX = grid_id)
duplicate <- od_data %>%
group_by_all() %>%
filter(n()>1) %>%
ungroup()
duplicate# A tibble: 0 × 5
# ℹ 5 variables: ORIGIN_BS <chr>, DESTIN_BS <chr>, TRIPS <dbl>,
# ORIGIN_HEX <int>, DESTIN_HEX <int>
show code
# od_data[!complete.cases(od_data), ]- Once again, we have an empty tibble.
- We finally create the final version of our flow dataset.
- Just to be safe, we perform a step to remove duplicates using
unique() drop_na()removes bus stops for which we have no info. It turns out there are bus stop numbers outside our bus stop dataset, like 03361, 03549, 03579, 59009;- We use
group_by()to combine rows with the same$ORIGIN_HEXand$DESTIN_HEX, and aggregate the number of trips withsummarise()
- Just to be safe, we perform a step to remove duplicates using
show code
od_data <- od_data %>%
unique() %>%
drop_na() %>%
group_by(ORIGIN_HEX, DESTIN_HEX) %>%
summarise(SUM_TRIPS = sum(TRIPS))`summarise()` has grouped output by 'ORIGIN_HEX'. You can override using the
`.groups` argument.
show code
head(od_data)# A tibble: 6 × 3
# Groups: ORIGIN_HEX [1]
ORIGIN_HEX DESTIN_HEX SUM_TRIPS
<int> <int> <dbl>
1 2 6 3
2 2 13 34
3 2 20 182
4 2 37 1
5 2 50 18
6 2 53 1
- Now we write the output to RDS
show code
write_rds(od_data, "data/rds/od_data.rds")
od_data <- read_rds("data/rds/od_data.rds")- Remove intrazonal flows
show code
od_data1 <- od_data[od_data$ORIGIN_HEX!=od_data$DESTIN_HEX,]- Create desire lines
show code
flowLine <- od2line(flow = od_data1,
zones = hexagon_sf,
zone_code = "grid_id")Creating centroids representing desire line start and end points.
show code
summary(od_data1$SUM_TRIPS) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 4.0 17.0 129.6 66.0 73720.0
- Quick exploration: most values are below 500
show code
lplot <- ggplot(data = flowLine, aes(x=SUM_TRIPS, y=ORIGIN_HEX)) +
geom_point() +
labs(title = "Histogram with Different Fill Colors",
x = "Value",
y = "Frequency") +
geom_vline(xintercept = 1000, colour="red")
rplot1 <- ggplot(data = flowLine,
aes(x=SUM_TRIPS)) +
geom_histogram(breaks=seq(0, 20000, by = 1000)) +
geom_vline(xintercept = 1000, colour="red")
rplot2 <- ggplot(data = flowLine,
aes(x=SUM_TRIPS)) +
geom_histogram(breaks=seq(0, 2000, by = 500)) +
geom_vline(xintercept = 1000, colour="red")
lplot / (rplot1 + rplot2)
custom_palette <- c('#fee090','#d73027','#72001a', '#313695')
tmap_mode("plot")tmap mode set to plotting
tm_shape(hexagon_sf) +
tm_polygons() +
flowLine %>%
filter(SUM_TRIPS >= 1000) %>%
tm_shape() +
tm_lines(col = "SUM_TRIPS",
lwd = "SUM_TRIPS",
palette = custom_palette,
style = "fixed",
breaks = c(1000, 2000, 4000, 8000, 100000),
scale = c(0.1, 1, 3, 5, 7, 10),
n = 6,
alpha = 0.8)
- Plotting trips over 10,000:
- 150 trips over 10,000
- 20 trips over 20,000
flowLine[flowLine$SUM_TRIPS>=20000,]Simple feature collection with 20 features and 3 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 12032.62 ymin: 32977.29 xmax: 41095.12 ymax: 47591.47
Projected CRS: SVY21 / Singapore TM
First 10 features:
ORIGIN_HEX DESTIN_HEX SUM_TRIPS geometry
4625 279 191 20266 LINESTRING (13907.62 35900....
4637 279 204 23717 LINESTRING (13907.62 35900....
4665 279 237 30565 LINESTRING (13907.62 35900....
4675 279 253 21468 LINESTRING (13907.62 35900....
4676 279 254 23279 LINESTRING (13907.62 35900....
6903 325 309 26335 LINESTRING (15220.12 36224....
12752 427 405 49216 LINESTRING (17845.12 40771....
16829 485 516 21658 LINESTRING (18595.12 42070....
24552 594 653 59391 LINESTRING (20095.12 45318....
26612 615 587 22427 LINESTRING (20470.12 32977....
custom_palette <- c('#fee090','#fee090','#d73027','#d73027','#72001a', '#313695')
tmap_mode("plot")tmap mode set to plotting
tm_shape(hexagon_sf) +
tm_polygons() +
flowLine %>%
filter(SUM_TRIPS >= 10000) %>%
tm_shape() +
tm_lines(col = "SUM_TRIPS",
lwd = "SUM_TRIPS",
palette = custom_palette,
style = "jenks",,
scale = c(1, 2, 4, 8, 10, 14),
n = 6,
alpha = 0.8)
show code
od_data1_origin <- od_data1 %>%
group_by(ORIGIN_HEX) %>%
summarise(SUM_ORIG_TRIPS = sum(SUM_TRIPS))
orig_hexagon_sf <- left_join(hexagon_sf, od_data1_origin,
by = c("grid_id" = "ORIGIN_HEX")) %>%
select(grid_id,
SUM_ORIG_TRIPS, raw_hex_grid) %>%
drop_na()
ggplot(orig_hexagon_sf, aes(x = grid_id , y = SUM_ORIG_TRIPS)) +
geom_boxplot(fill = "lightblue", color = "black", outlier.shape = 16, outlier.colour = "red") +
geom_point(position = position_jitter(width = 0.2), color = "red", size = .5) +
scale_y_log10() +
labs(title = "Boxplot with Outliers Highlighted",
x = "Category",
y = "Value")Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

- find top 10% of tirps by origin hexes origin
show code
origin_threshold <- quantile(orig_hexagon_sf$SUM_ORIG_TRIPS, 0.99) #237114
orig_trips_q90 <- quantile(orig_hexagon_sf$SUM_ORIG_TRIPS, 0.9) #237114
orig_trips_q99 <- quantile(orig_hexagon_sf$SUM_ORIG_TRIPS, 0.99) #237114
orig_trips_q999 <- quantile(orig_hexagon_sf$SUM_ORIG_TRIPS, 0.999) #237114
custom_origin_breaks <- c(0, orig_trips_q90, orig_trips_q99, orig_trips_q999, Inf)
# orig_hexagon_sf[orig_hexagon_sf$SUM_ORIG_TRIPS>500000,]
# custom_origin_breaks
# filter(orig_hexagon_sf, SUM_ORIG_TRIPS >= origin_threshold)
tmap_mode("plot")tmap mode set to plotting
show code
tm_shape(mpsz_sf) +
tm_polygons() +
tm_shape(orig_hexagon_sf) +
tm_polygons(col="SUM_ORIG_TRIPS", palette="YlGnBu", style = "fixed", breaks = custom_origin_breaks)Warning: The shape mpsz_sf is invalid. See sf::st_is_valid

show code
tmap_mode("plot")tmap mode set to plotting
show code
tm_shape(mpsz_sf) +
tm_polygons() +
tm_shape(filter(orig_hexagon_sf, SUM_ORIG_TRIPS >= 398408.9) ) +
tm_polygons(col="SUM_ORIG_TRIPS", palette="-plasma") +
flowLine %>%
filter(ORIGIN_HEX %in% c(279, 771, 2004)) %>%
tm_shape() +
tm_lines(col = "SUM_TRIPS",
lwd = "SUM_TRIPS",
palette = custom_palette,
style = "quantile",
scale = c(1, 2, 4, 8, 10, 14),
n = 6,
alpha = 0.8)Warning: The shape mpsz_sf is invalid. See sf::st_is_valid

show code
# tm_basemap(providers$OneMapSG.Grey) +
# tm_shape(hexagon_sf) +
# tm_fill(
# col = "n_bus_stops",
# palette = "-plasma",
# style = "cont",
#
# breaks = c(0, 1, 2- plot top 10 % of hexes by destination
show code
od_data1_destin <- od_data1 %>%
group_by(DESTIN_HEX) %>%
summarise(SUM_DEST_TRIPS = sum(SUM_TRIPS))
dest_hexagon_sf <- left_join(hexagon_sf, od_data1_destin,
by = c("grid_id" = "DESTIN_HEX")) %>%
select(grid_id,
SUM_DEST_TRIPS, raw_hex_grid) %>%
drop_na()
dest_trips_q90 <- quantile(dest_hexagon_sf$SUM_DEST_TRIPS, 0.9) #237114
dest_trips_q99 <- quantile(dest_hexagon_sf$SUM_DEST_TRIPS, 0.99) #237114
dest_trips_q999 <- quantile(dest_hexagon_sf$SUM_DEST_TRIPS, 0.999) #237114
custom_destin_breaks <- c(0, dest_trips_q90, dest_trips_q99, dest_trips_q999, Inf)
tmap_mode("plot")tmap mode set to plotting
show code
tm_shape(mpsz_sf) +
tm_polygons() +
tm_shape(dest_hexagon_sf) +
tm_polygons(col="SUM_DEST_TRIPS", palette="YlOrRd", style = "fixed", breaks = custom_origin_breaks)Warning: The shape mpsz_sf is invalid. See sf::st_is_valid

1.2 computing Distance matrix
- calculate & find top 10% of tirps by destin hexes
show code
hexagon_sf <- subset(hexagon_sf, grid_id != 1) # run without this line for 2172 hexagons
hexagon_sp <- as(hexagon_sf, "Spatial")
distance_matrix <- spDists(hexagon_sp,
longlat = FALSE)
head(distance_matrix, n=c(10, 10)) [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.0000 750.0000 2625.0000 3269.1742 649.5191 2341.8742 2087.9116
[2,] 750.0000 0.0000 1948.5572 2598.0762 375.0000 1634.5871 1352.0817
[3,] 2625.0000 1948.5572 0.0000 649.5191 2281.0359 375.0000 750.0000
[4,] 3269.1742 2598.0762 649.5191 0.0000 2928.8436 992.1567 1352.0817
[5,] 649.5191 375.0000 2281.0359 2928.8436 0.0000 1948.5572 1634.5871
[6,] 2341.8742 1634.5871 375.0000 992.1567 1948.5572 0.0000 375.0000
[7,] 2087.9116 1352.0817 750.0000 1352.0817 1634.5871 375.0000 0.0000
[8,] 3968.6270 3269.1742 1352.0817 750.0000 3577.2720 1634.5871 1948.5572
[9,] 1352.0817 649.5191 1718.4659 2341.8742 750.0000 1352.0817 992.1567
[10,] 1875.0000 1125.0000 1125.0000 1718.4659 1352.0817 750.0000 375.0000
[,8] [,9] [,10]
[1,] 3968.627 1352.0817 1875.0000
[2,] 3269.174 649.5191 1125.0000
[3,] 1352.082 1718.4659 1125.0000
[4,] 750.000 2341.8742 1718.4659
[5,] 3577.272 750.0000 1352.0817
[6,] 1634.587 1352.0817 750.0000
[7,] 1948.557 992.1567 375.0000
[8,] 0.000 2928.8436 2281.0359
[9,] 2928.844 0.0000 649.5191
[10,] 2281.036 649.5191 0.0000
show code
# dim(distance_matrix)
# hexagon_sf[hexagon_sf$grid_id == 2172,]- melt distance matrix into distance pair
show code
distPair <- melt(distance_matrix) %>%
rename(dist = value
, grid_id_from = Var1
, grid_id_to = Var2)
head(distPair, 10) grid_id_from grid_id_to dist
1 1 1 0.0000
2 2 1 750.0000
3 3 1 2625.0000
4 4 1 3269.1742
5 5 1 649.5191
6 6 1 2341.8742
7 7 1 2087.9116
8 8 1 3968.6270
9 9 1 1352.0817
10 10 1 1875.0000
show code
# 1919 2172
# subset(distPair, grid_id_from %in% c(1919, 2172) & grid_id_to %in% c(1919, 2172))
# distPair[distPair$grid_id_from == 1919 & distPair$grid_id_to == 2172, ]- melt distance matrix into distance pair
show code
distPair %>%
filter(dist > 0) %>%
summary() grid_id_from grid_id_to dist
Min. : 1 Min. : 1 Min. : 375
1st Qu.: 543 1st Qu.: 543 1st Qu.: 7875
Median :1086 Median :1086 Median :12898
Mean :1086 Mean :1086 Mean :13607
3rd Qu.:1629 3rd Qu.:1629 3rd Qu.:18221
Max. :2171 Max. :2171 Max. :44554
- Minimum distance 375
- We will arbirtraitly insert 50m into interzonal distance
distPair$dist <- ifelse(distPair$dist == 0,
50, distPair$dist)
distPair %>%
summary() grid_id_from grid_id_to dist
Min. : 1 Min. : 1 Min. : 50
1st Qu.: 543 1st Qu.: 543 1st Qu.: 7875
Median :1086 Median :1086 Median :12898
Mean :1086 Mean :1086 Mean :13600
3rd Qu.:1629 3rd Qu.:1629 3rd Qu.:18221
Max. :2171 Max. :2171 Max. :44554
write_rds(distPair, "data/rds/distPair.rds") 1.3 prepare flow data
od_data# A tibble: 186,031 × 3
# Groups: ORIGIN_HEX [2,128]
ORIGIN_HEX DESTIN_HEX SUM_TRIPS
<int> <int> <dbl>
1 2 6 3
2 2 13 34
3 2 20 182
4 2 37 1
5 2 50 18
6 2 53 1
7 2 54 6
8 2 57 145
9 3 8 2
10 3 13 4
# ℹ 186,021 more rows
od_data$od_no_intra <- ifelse(
od_data$ORIGIN_HEX == od_data$DESTIN_HEX,
0, od_data$SUM_TRIPS)
od_data$offset <- ifelse(
od_data$ORIGIN_HEX == od_data$DESTIN_HEX,
0.000001, 1)
# od_data$ORIGIN_HEX <- as.factor(od_data$ORIGIN_HEX)
# od_data$DESTIN_HEX <- as.factor(od_data$DESTIN_HEX)
#
od_data_distpair <- od_data %>%
left_join (distPair,
by = c("ORIGIN_HEX" = "grid_id_from",
"DESTIN_HEX" = "grid_id_to"))
od_data_distpair[od_data_distpair$od_no_intra == 0, ]# A tibble: 757 × 6
# Groups: ORIGIN_HEX [757]
ORIGIN_HEX DESTIN_HEX SUM_TRIPS od_no_intra offset dist
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 15 15 1 0 0.000001 50
2 35 35 2 0 0.000001 50
3 54 54 40 0 0.000001 50
4 57 57 3 0 0.000001 50
5 69 69 1 0 0.000001 50
6 72 72 3 0 0.000001 50
7 74 74 7 0 0.000001 50
8 83 83 1 0 0.000001 50
9 99 99 1 0 0.000001 50
10 127 127 2 0 0.000001 50
# ℹ 747 more rows
- however, we should filter this down to interzonal flow:
inter_zonal_flow <- od_data_distpair %>%
filter(od_no_intra > 0)- visualize dependent variable
ggplot(data = inter_zonal_flow,
aes(x = SUM_TRIPS)) +
geom_histogram()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

- visualize dependent variable
ggplot(data = inter_zonal_flow,
aes(x = log(dist),
y = log(SUM_TRIPS))) +
geom_point() +
geom_smooth(method = lm)`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 22 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 22 rows containing missing values (`geom_point()`).

- check no zero values
summary(inter_zonal_flow) ORIGIN_HEX DESTIN_HEX SUM_TRIPS od_no_intra offset
Min. : 2 Min. : 2 Min. : 1.0 Min. : 1.0 Min. :1
1st Qu.: 848 1st Qu.: 828 1st Qu.: 4.0 1st Qu.: 4.0 1st Qu.:1
Median :1276 Median :1273 Median : 17.0 Median : 17.0 Median :1
Mean :1238 Mean :1237 Mean : 129.6 Mean : 129.6 Mean :1
3rd Qu.:1648 3rd Qu.:1657 3rd Qu.: 66.0 3rd Qu.: 66.0 3rd Qu.:1
Max. :2172 Max. :2172 Max. :73720.0 Max. :73720.0 Max. :1
dist
Min. : 375
1st Qu.: 2625
Median : 5073
Mean : 6112
3rd Qu.: 8738
Max. :25334
NA's :22
- unconstrained
uncSIM <- glm(formula = SUM_TRIPS ~
log(dist),
family = poisson(link = "log"),
data = od_data_distpair,
na.action = na.exclude)
uncSIM
Call: glm(formula = SUM_TRIPS ~ log(dist), family = poisson(link = "log"),
data = od_data_distpair, na.action = na.exclude)
Coefficients:
(Intercept) log(dist)
9.6149 -0.5881
Degrees of Freedom: 186008 Total (i.e. Null); 186007 Residual
(22 observations deleted due to missingness)
Null Deviance: 95130000
Residual Deviance: 84160000 AIC: 85050000
1.6.2 Identify Bus Trip Details For Each Hexagon grid_id
- Here, we again use multiple steps to generate bus trip details for each hexagon
grid_id;- We use
left_join()to add thegrid_idto each row ofodbus_filtered, since each row has a unique single bus stop ID (i.e.$BUS_STOP_N); - We use
select()to retain only thegrid_idand the four peak-trips columns; - We combine
group_by()andsummarise()to aggregate the trips for each peak bygrid_id.
- We use
- Finally, we use
head()to preview thegrid_trips_dftibble dataframe.
show code
colnames(odbus_filtered)[1] "ORIGIN_PT_CODE" "DESTINATION_PT_CODE" "TRIPS"
show code
grid_trips_df <- left_join(odbus_filtered, bus_stop_hexgrid_id,
by = c("ORIGIN_PT_CODE" = "BUS_STOP_N")) %>%
select(grid_id,
TRIPS) %>%
group_by(grid_id) %>%
summarise(
TRIPS = sum(TRIPS)
)Adding missing grouping variables: `ORIGIN_PT_CODE`
show code
head(grid_trips_df)# A tibble: 6 × 2
grid_id TRIPS
<int> <dbl>
1 2 390
2 3 114
3 4 291
4 5 241
5 6 1905
6 7 299
1.6.3 Combine Bus Trip Details Back Into hexagon_sf
- Finally, it’s time to recombine bus trip data back into
hexagon_sf;- We use
left_joinon$grid_idto add trip data back into hexagon_sf; - We add a failsafe
mutate()to replace any “NA” values for the columns.
- We use
show code
hexagon_sf <- left_join(hexagon_sf, grid_trips_df,
by = 'grid_id' ) %>%
mutate(
TRIPS = ifelse(is.na(TRIPS), 0, TRIPS)
)
head(hexagon_sf)Simple feature collection with 6 features and 3 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 27889.39 xmax: 4907.622 ymax: 31570
Projected CRS: SVY21 / Singapore TM
grid_id n_bus_stops TRIPS raw_hex_grid
1 2 1 390 POLYGON ((4157.622 27889.39...
2 3 1 114 POLYGON ((4532.622 28538.91...
3 4 1 291 POLYGON ((4532.622 30487.47...
4 5 1 241 POLYGON ((4532.622 31136.99...
5 6 1 1905 POLYGON ((4720.122 28214.15...
6 7 1 299 POLYGON ((4720.122 30162.71...
1.7 Exploratory Data Analysis Of Bus Trips, Across Peak Periods, By Hexagons
- For
ggplot, we need data in long format, so we can usegather()ongrid_trips_dffrom Section 1.6.2 to pivot this; - We then pipe this into a
geom_boxplot()for an exploratory look:
show code
custom_breaks <- c(seq(0, 50000, by = 5000), 565000)
lplot <- ggplot(data = hexagon_sf, aes(x=TRIPS, y=grid_id)) +
geom_point() +
labs(title = "Histogram with Different Fill Colors",
x = "Value",
y = "Frequency") +
geom_vline(xintercept = 20000, colour="red")
rplot <- ggplot(data = hexagon_sf[hexagon_sf$TRIPS <50001, ]
, aes(x=TRIPS)) +
geom_histogram() + #breaks=seq(0, 50000, by = 5000)
geom_vline(xintercept = 20000, colour="red")
lplot / rplot`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

show code
# str(hexagon_sf$TRIPS)
# gather(grid_trips_df, key = "Peak", value = "Trips", -grid_id) %>%
# ggplot( aes(x=grid_id, y='TRIPS', fill=grid_id)) +
# geom_boxplot() +
# ggtitle("Boxplot: Trips over peak periods, 2023-Oct data") +
# xlab("") +
# theme(
# legend.position="none"
# ) +
# coord_flip()We also observe that number of trips for Weekday Morning & Weekday Afternoon seems to be larger than Weekend Morning and Weekend Evening peak trips. This is also confirmed by the figure in the next section.
This means that we will have to consider Weekday and Weekend peaks on different scales.
- This is an exceptionally ugly plot, but it shows an important point: there is some serious right skew in our dataset;
- Clearly, there are some hexagons with exceptionally high trips compared to the rest of the hexagons But, could this be because some hexagons have up to 11 bus stops, while others have 1 or 2?
- We do a quick scatter plot based on
$n_bus_stopsto verify this:
show code
# hexagon_sf %>%
# st_drop_geometry() %>%
# pivot_longer(cols = starts_with("WEEK"),
# names_to = "PEAK", values_to = "TRIPS") %>%
# ggplot( aes(x=TRIPS, y=n_bus_stops, color=PEAK, shape=PEAK)) +
# geom_point(size=2) +
# ggtitle("Scatterplot: Trips over peak periods by number of bus stops per hexagon, 2023-Oct data") +
# theme(legend.position=c(.85, .15),
# legend.background = element_rect(fill = "transparent"),
# legend.key.size = unit(0.5, "cm"),
# legend.text = element_text(size = 6),
# legend.title = element_text(size = 8)
# ) - Surprising results from our plot! If we consider those with > 100,000 trips as outliers, most of them come from hexagons with between 4-8 bus stops;
- There is some correlation between number of bus stops and high numbers of trips, but a stronger factor is peak time; Weekday Morning peak trips, followed by Weekday Afternoon peak trips, contribute to the largest outliers.
I note that these visualizations only scrape the surface of understanding the data. However, this is not the focus of our study; we do these quick visualizations only to provide better context for our study.
2. Preparing three propulsive and three attractiveness variables
2.1 Preparing Business:
- read in business values
- check st_crs: correct
show code
business_sf <- st_read(dsn = "data/geospatial",
layer = "Business")Reading layer `Business' from data source
`C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 6550 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3669.148 ymin: 25408.41 xmax: 47034.83 ymax: 50148.54
Projected CRS: SVY21 / Singapore TM
show code
st_crs(business_sf)Coordinate Reference System:
User input: SVY21 / Singapore TM
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
- quick visualization
tmap_mode("plot")tmap mode set to plotting
# tmap_options(check.and.fix = TRUE)
tm_shape(hexagon_sf) +
tm_polygons() +
tm_shape(business_sf) +
tm_dots()
- point-in-plot count, test for NA
show code
hexagon_sf$`BUSINESS_COUNT`<- lengths(
st_intersects(
hexagon_sf, business_sf))
summary(hexagon_sf$BUSINESS_COUNT) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 0.00 2.08 2.00 44.00
show code
# hexagon_sf[hexagon_sf$BUSINESS_COUNT > 30,]
# hexagon_sf[is.na(hexagon_sf$BUSINESS_COUNT),]2.2 Repeat steps for various other factors:
show code
entertn_sf <- st_read(dsn = "data/geospatial",
layer = "entertn")Reading layer `entertn' from data source
`C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 114 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 10809.34 ymin: 26528.63 xmax: 41600.62 ymax: 46375.77
Projected CRS: SVY21 / Singapore TM
show code
# st_crs(entertn_sf)
hexagon_sf$`ENTERTAIN_COUNT`<- lengths(
st_intersects(
hexagon_sf, entertn_sf))
summary(hexagon_sf$ENTERTAIN_COUNT) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.0456 0.0000 7.0000
show code
fnb_sf <- st_read(dsn = "data/geospatial",
layer = "F&B")Reading layer `F&B' from data source
`C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 1919 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6010.495 ymin: 25343.27 xmax: 45462.43 ymax: 48796.21
Projected CRS: SVY21 / Singapore TM
show code
# st_crs(fnb_sf)
hexagon_sf$`FNB_COUNT`<- lengths(
st_intersects(
hexagon_sf, fnb_sf))
summary(hexagon_sf$FNB_COUNT) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 0.807 0.000 81.000
show code
finserv_sf <- st_read(dsn = "data/geospatial",
layer = "FinServ")Reading layer `FinServ' from data source
`C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 3320 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 4881.527 ymin: 25171.88 xmax: 46526.16 ymax: 49338.02
Projected CRS: SVY21 / Singapore TM
show code
# st_crs(finserv_sf)
hexagon_sf$`FINSERV_COUNT`<- lengths(
st_intersects(
hexagon_sf, finserv_sf))
summary(hexagon_sf$FINSERV_COUNT) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 1.423 1.000 76.000
show code
leisure_sf <- st_read(dsn = "data/geospatial",
layer = "Liesure&Recreation")Reading layer `Liesure&Recreation' from data source
`C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 1217 features and 30 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6010.495 ymin: 25134.28 xmax: 48439.77 ymax: 50078.88
Projected CRS: SVY21 / Singapore TM
show code
# st_crs(leisure_sf)
hexagon_sf$`LEISURE_COUNT`<- lengths(
st_intersects(
hexagon_sf, leisure_sf))
summary(hexagon_sf$LEISURE_COUNT) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.4017 0.0000 23.0000
show code
retail_sf <- st_read(dsn = "data/geospatial",
layer = "Retails")Reading layer `Retails' from data source
`C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 37635 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 4737.982 ymin: 25171.88 xmax: 48265.04 ymax: 50135.28
Projected CRS: SVY21 / Singapore TM
show code
# st_crs(retail_sf)
hexagon_sf$`RETAIL_COUNT`<- lengths(
st_intersects(
hexagon_sf, retail_sf))
summary(hexagon_sf$RETAIL_COUNT) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 2.00 15.46 9.00 987.00
show code
mrt_exit_sf <- st_read(dsn = "data/geospatial",
layer = "Train_Station_Exit_Layer") %>%
st_transform(crs=3414)Reading layer `Train_Station_Exit_Layer' from data source
`C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 565 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21
show code
st_crs(mrt_exit_sf)Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
show code
hexagon_sf$`MRT_EXIT_COUNT`<- lengths(
st_intersects(
hexagon_sf, mrt_exit_sf))
summary(hexagon_sf$MRT_EXIT_COUNT) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.2446 0.0000 9.0000
- First, let’s write hexagon_sf to rds
write_rds(hexagon_sf, "data/rds/hexagon_features.rds")
head(hexagon_sf)Simple feature collection with 6 features and 10 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 27889.39 xmax: 4907.622 ymax: 31570
Projected CRS: SVY21 / Singapore TM
grid_id n_bus_stops TRIPS raw_hex_grid BUSINESS_COUNT
1 2 1 390 POLYGON ((4157.622 27889.39... 0
2 3 1 114 POLYGON ((4532.622 28538.91... 1
3 4 1 291 POLYGON ((4532.622 30487.47... 7
4 5 1 241 POLYGON ((4532.622 31136.99... 1
5 6 1 1905 POLYGON ((4720.122 28214.15... 1
6 7 1 299 POLYGON ((4720.122 30162.71... 30
ENTERTAIN_COUNT FNB_COUNT FINSERV_COUNT LEISURE_COUNT RETAIL_COUNT
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 1
MRT_EXIT_COUNT
1 0
2 0
3 0
4 0
5 0
6 0
- Now we perform plot of layers
tmap_mode("view")tmap mode set to interactive viewing
tm <- tm_shape(hexagon_sf, group="FINSERV_COUNT") +
tm_borders(group="FINSERV_COUNT") +
tm_fill(col = "FINSERV_COUNT",
palette = "Greens",
style = "jenks",
group="FINSERV_COUNT") +
tm_shape(hexagon_sf, group="FNB_COUNT") +
tm_borders(group="FNB_COUNT") +
tm_fill(col = "FNB_COUNT",
palette = "Reds",
style = "jenks",
group="FNB_COUNT") +
tm_shape(hexagon_sf, group="ENTERTAIN_COUNT") +
tm_borders(group="ENTERTAIN_COUNT") +
tm_fill(col = "ENTERTAIN_COUNT",
palette = "Blues",
style = "jenks",
group="ENTERTAIN_COUNT") +
tm_shape(hexagon_sf, group="LEISURE_COUNT") +
tm_borders(group="LEISURE_COUNT") +
tm_fill(col = "LEISURE_COUNT",
palette = "Purples",
style = "jenks",
group="LEISURE_COUNT")
tm %>%
tmap_leaflet() %>%
hideGroup(c("FNB_COUNT", "ENTERTAIN_COUNT", "LEISURE_COUNT")) %>%
addLayersControl(
overlayGroups = c("FINSERV_COUNT", "FNB_COUNT",
"ENTERTAIN_COUNT", "LEISURE_COUNT"),
options = layersControlOptions(collapsed = FALSE),
position = "topleft"
)# tmap_mode("plot")
# # tmap_options(check.and.fix = TRUE)
# tm_shape(hexagon_sf) +
# tm_polygons() +
# tm_shape(entertn_sf) +
# tm_dots()
#
# tmap_mode("view")
# tm_shape(hexagon_sf[hexagon_sf$FINSERV_COUNT > 70, ]) +
# tm_polygons() +
# tm_shape(hexagon_sf[hexagon_sf$FNB_COUNT > 70, ]) +
# tm_polygons() +
# tm_shape(hexagon_sf[hexagon_sf$ENTERTAIN_COUNT > 6, ]) +
# tm_polygons() +
# tm_shape(hexagon_sf[hexagon_sf$LEISURE_COUNT > 20, ]) +
# tm_polygons() +
# tm_shape(hexagon_sf[hexagon_sf$RETAIL_COUNT > 100, ]) +
# tm_polygons()show code
# entertn_sf <- st_read(dsn = "data/geospatial",
# layer = "entertn")
# st_crs(entertn_sf)
#
# fnb_sf <- st_read(dsn = "data/geospatial",
# layer = "F&B")
# st_crs(fnb_sf)
#
# finserv_sf <- st_read(dsn = "data/geospatial",
# layer = "FinServ")
# st_crs(finserv_sf)
leisure_sf <- st_read(dsn = "data/geospatial",
layer = "Liesure&Recreation")Reading layer `Liesure&Recreation' from data source
`C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 1217 features and 30 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6010.495 ymin: 25134.28 xmax: 48439.77 ymax: 50078.88
Projected CRS: SVY21 / Singapore TM
show code
st_crs(leisure_sf)Coordinate Reference System:
User input: SVY21 / Singapore TM
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
show code
leisure_sf <- st_read(dsn = "data/geospatial",
layer = "Liesure&Recreation")Reading layer `Liesure&Recreation' from data source
`C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 1217 features and 30 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6010.495 ymin: 25134.28 xmax: 48439.77 ymax: 50078.88
Projected CRS: SVY21 / Singapore TM
show code
st_crs(leisure_sf)Coordinate Reference System:
User input: SVY21 / Singapore TM
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
- read & Fix mrt
- use st_make_valid() does not work, so I sourced wdpar::st_repair_geometry
wdpar source, based onpreprpackage to repair geometries- thereafter we need to coerce from EPSG9001 to svy21
show code
library(wdpar)
raw_mrt_sf <- st_read(dsn = "data/geospatial",
layer = "RapidTransitSystemStation")Reading layer `RapidTransitSystemStation' from data source
`C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, : GDAL
Message 1: Non closed ring detected. To avoid accepting it, set the
OGR_GEOMETRY_ACCEPT_UNCLOSED_RING configuration option to NO
Simple feature collection with 220 features and 4 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21
show code
mrt_sf <- st_repair_geometry(raw_mrt_sf) %>%
st_transform(crs=3414)
st_crs(mrt_sf)Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
show code
tmap_mode("plot")tmap mode set to plotting
show code
# tmap_options(check.and.fix = TRUE)
tm_shape(hexagon_sf) +
tm_polygons(col = "MRT_EXIT_COUNT", palette = "-Spectral") +
tm_shape(mrt_sf) +
tm_dots(col='red')
- Largely seem to overlap;
- Based on reasoning, MRT Exits seem like a more likely source of entry points, we will use that
3. Spatial Interactive Models
3.1 Create SIM_DATA
Combine Flow data with hexagon sf
Propulsive:
- business_count
- MRT_EXIT_COUNT
- finserv count
Attractive:
- leisure_count
- fnb_count
- entertainment_count
3.1.P Propulsive count
show code
prop_sf <- hexagon_sf %>%
select(grid_id, BUSINESS_COUNT, FNB_COUNT, FINSERV_COUNT, MRT_EXIT_COUNT) %>%
st_drop_geometry()
inter_zonal_flow_prop <- inter_zonal_flow %>%
left_join(prop_sf,
by = c("ORIGIN_HEX" = "grid_id")) %>%
rename(P_MRT_EXIT_COUNT = MRT_EXIT_COUNT,
P_FNB_COUNT = FNB_COUNT,
P_BUSINESS_COUNT = BUSINESS_COUNT)
inter_zonal_flow_prop# A tibble: 185,274 × 10
# Groups: ORIGIN_HEX [2,128]
ORIGIN_HEX DESTIN_HEX SUM_TRIPS od_no_intra offset dist P_BUSINESS_COUNT
<int> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 2 6 3 3 1 1635. 0
2 2 13 34 34 1 3616. 0
3 2 20 182 182 1 1875. 0
4 2 37 1 1 1 1635. 0
5 2 50 18 18 1 5773. 0
6 2 53 1 1 1 7658. 0
7 2 54 6 6 1 2704. 0
8 2 57 145 145 1 8017. 0
9 3 8 2 2 1 1352. 1
10 3 13 4 4 1 1718. 1
# ℹ 185,264 more rows
# ℹ 3 more variables: P_FNB_COUNT <int>, FINSERV_COUNT <int>,
# P_MRT_EXIT_COUNT <int>
3.1.P Attractiveness count
show code
attr_sf <- hexagon_sf %>%
select(grid_id, ENTERTAIN_COUNT , FNB_COUNT, LEISURE_COUNT, RETAIL_COUNT, MRT_EXIT_COUNT) %>%
st_drop_geometry()
SIM_data <- inter_zonal_flow_prop %>%
left_join(attr_sf,
by = c("DESTIN_HEX" = "grid_id")) %>%
rename(A_MRT_EXIT_COUNT = MRT_EXIT_COUNT,
A_FNB_COUNT = FNB_COUNT,
A_FINSERV_COUNT = FINSERV_COUNT,
A_ENTERTAIN_COUNT = ENTERTAIN_COUNT,
A_LEISURE_COUNT = LEISURE_COUNT,
A_RETAIL_COUNT = RETAIL_COUNT)
SIM_data# A tibble: 185,274 × 15
# Groups: ORIGIN_HEX [2,128]
ORIGIN_HEX DESTIN_HEX SUM_TRIPS od_no_intra offset dist P_BUSINESS_COUNT
<int> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 2 6 3 3 1 1635. 0
2 2 13 34 34 1 3616. 0
3 2 20 182 182 1 1875. 0
4 2 37 1 1 1 1635. 0
5 2 50 18 18 1 5773. 0
6 2 53 1 1 1 7658. 0
7 2 54 6 6 1 2704. 0
8 2 57 145 145 1 8017. 0
9 3 8 2 2 1 1352. 1
10 3 13 4 4 1 1718. 1
# ℹ 185,264 more rows
# ℹ 8 more variables: P_FNB_COUNT <int>, A_FINSERV_COUNT <int>,
# P_MRT_EXIT_COUNT <int>, A_ENTERTAIN_COUNT <int>, A_FNB_COUNT <int>,
# A_LEISURE_COUNT <int>, A_RETAIL_COUNT <int>, A_MRT_EXIT_COUNT <int>
- Check summary
show code
summary(SIM_data) ORIGIN_HEX DESTIN_HEX SUM_TRIPS od_no_intra offset
Min. : 2 Min. : 2 Min. : 1.0 Min. : 1.0 Min. :1
1st Qu.: 848 1st Qu.: 828 1st Qu.: 4.0 1st Qu.: 4.0 1st Qu.:1
Median :1276 Median :1273 Median : 17.0 Median : 17.0 Median :1
Mean :1238 Mean :1237 Mean : 129.6 Mean : 129.6 Mean :1
3rd Qu.:1648 3rd Qu.:1657 3rd Qu.: 66.0 3rd Qu.: 66.0 3rd Qu.:1
Max. :2172 Max. :2172 Max. :73720.0 Max. :73720.0 Max. :1
dist P_BUSINESS_COUNT P_FNB_COUNT A_FINSERV_COUNT
Min. : 375 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 2625 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 5073 Median : 0.000 Median : 0.000 Median : 0.000
Mean : 6112 Mean : 1.932 Mean : 1.902 Mean : 3.159
3rd Qu.: 8738 3rd Qu.: 2.000 3rd Qu.: 0.000 3rd Qu.: 3.000
Max. :25334 Max. :44.000 Max. :81.000 Max. :76.000
NA's :22
P_MRT_EXIT_COUNT A_ENTERTAIN_COUNT A_FNB_COUNT A_LEISURE_COUNT
Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. : 0.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 0.000
Median :0.0000 Median :0.0000 Median : 0.000 Median : 0.000
Mean :0.5174 Mean :0.1011 Mean : 1.772 Mean : 0.659
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.: 0.000 3rd Qu.: 1.000
Max. :9.0000 Max. :7.0000 Max. :81.000 Max. :23.000
A_RETAIL_COUNT A_MRT_EXIT_COUNT
Min. : 0.00 Min. :0.0000
1st Qu.: 1.00 1st Qu.:0.0000
Median : 5.00 Median :0.0000
Mean : 31.94 Mean :0.5025
3rd Qu.: 22.00 3rd Qu.:0.0000
Max. :987.00 Max. :9.0000
- We will coerce all 0 values to 0.99
show code
replace_zeroes_columns <- c("P_BUSINESS_COUNT", "P_FNB_COUNT", "A_FINSERV_COUNT", "P_MRT_EXIT_COUNT", "A_ENTERTAIN_COUNT", "A_FNB_COUNT", "A_LEISURE_COUNT", "A_RETAIL_COUNT", "A_MRT_EXIT_COUNT")
# Assuming 'your_data' is your data frame and 'columns_to_update' is a vector of column names
# Apply a function to replace 0 with 1 in specified columns
SIM_data[, replace_zeroes_columns] <- apply(SIM_data[, replace_zeroes_columns], 2, function(x) replace(x, x == 0, 0.99))
summary(SIM_data) ORIGIN_HEX DESTIN_HEX SUM_TRIPS od_no_intra offset
Min. : 2 Min. : 2 Min. : 1.0 Min. : 1.0 Min. :1
1st Qu.: 848 1st Qu.: 828 1st Qu.: 4.0 1st Qu.: 4.0 1st Qu.:1
Median :1276 Median :1273 Median : 17.0 Median : 17.0 Median :1
Mean :1238 Mean :1237 Mean : 129.6 Mean : 129.6 Mean :1
3rd Qu.:1648 3rd Qu.:1657 3rd Qu.: 66.0 3rd Qu.: 66.0 3rd Qu.:1
Max. :2172 Max. :2172 Max. :73720.0 Max. :73720.0 Max. :1
dist P_BUSINESS_COUNT P_FNB_COUNT A_FINSERV_COUNT
Min. : 375 Min. : 0.990 Min. : 0.990 Min. : 0.990
1st Qu.: 2625 1st Qu.: 0.990 1st Qu.: 0.990 1st Qu.: 0.990
Median : 5073 Median : 0.990 Median : 0.990 Median : 0.990
Mean : 6112 Mean : 2.568 Mean : 2.649 Mean : 3.661
3rd Qu.: 8738 3rd Qu.: 2.000 3rd Qu.: 0.990 3rd Qu.: 3.000
Max. :25334 Max. :44.000 Max. :81.000 Max. :76.000
NA's :22
P_MRT_EXIT_COUNT A_ENTERTAIN_COUNT A_FNB_COUNT A_LEISURE_COUNT
Min. :0.990 Min. :0.99 Min. : 0.990 Min. : 0.990
1st Qu.:0.990 1st Qu.:0.99 1st Qu.: 0.990 1st Qu.: 0.990
Median :0.990 Median :0.99 Median : 0.990 Median : 0.990
Mean :1.313 Mean :1.03 Mean : 2.524 Mean : 1.371
3rd Qu.:0.990 3rd Qu.:0.99 3rd Qu.: 0.990 3rd Qu.: 1.000
Max. :9.000 Max. :7.00 Max. :81.000 Max. :23.000
A_RETAIL_COUNT A_MRT_EXIT_COUNT
Min. : 0.99 Min. :0.99
1st Qu.: 1.00 1st Qu.:0.99
Median : 5.00 Median :0.99
Mean : 32.13 Mean :1.30
3rd Qu.: 22.00 3rd Qu.:0.99
Max. :987.00 Max. :9.00
- Write to RDS
show code
write_rds(SIM_data,
"data/rds/SIM_data_tidy.rds")3.1 Create SIM_DATA
- Combine Flow data with hexagon sf
show code
# important! change the grid_id to factor/categorical or character field
# if numeric, treated as continuous variable
uncSIM <- glm(formula = SUM_TRIPS ~
log(P_BUSINESS_COUNT) +
log(P_FNB_COUNT) +
log(A_FINSERV_COUNT) +
log(P_MRT_EXIT_COUNT) +
log(A_ENTERTAIN_COUNT) +
log(A_FNB_COUNT) +
log(A_LEISURE_COUNT) +
log(A_FNB_COUNT) +
log(A_RETAIL_COUNT) +
log(A_MRT_EXIT_COUNT) +
log(dist),
family = poisson(link = "log"),
data = SIM_data,
na.action = na.exclude)
uncSIM
Call: glm(formula = SUM_TRIPS ~ log(P_BUSINESS_COUNT) + log(P_FNB_COUNT) +
log(A_FINSERV_COUNT) + log(P_MRT_EXIT_COUNT) + log(A_ENTERTAIN_COUNT) +
log(A_FNB_COUNT) + log(A_LEISURE_COUNT) + log(A_FNB_COUNT) +
log(A_RETAIL_COUNT) + log(A_MRT_EXIT_COUNT) + log(dist),
family = poisson(link = "log"), data = SIM_data, na.action = na.exclude)
Coefficients:
(Intercept) log(P_BUSINESS_COUNT) log(P_FNB_COUNT)
10.84672 -0.08899 -0.31041
log(A_FINSERV_COUNT) log(P_MRT_EXIT_COUNT) log(A_ENTERTAIN_COUNT)
0.41158 0.51325 -0.47519
log(A_FNB_COUNT) log(A_LEISURE_COUNT) log(A_RETAIL_COUNT)
-0.19659 -0.23866 0.10072
log(A_MRT_EXIT_COUNT) log(dist)
0.60585 -0.80863
Degrees of Freedom: 185251 Total (i.e. Null); 185241 Residual
(22 observations deleted due to missingness)
Null Deviance: 94860000
Residual Deviance: 70910000 AIC: 71790000
show code
SIM_data[which(! complete.cases(SIM_data)), ]# A tibble: 22 × 15
# Groups: ORIGIN_HEX [10]
ORIGIN_HEX DESTIN_HEX SUM_TRIPS od_no_intra offset dist P_BUSINESS_COUNT
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1919 2172 45 45 1 NA 0.99
2 1948 2172 2 2 1 NA 0.99
3 1985 2172 5 5 1 NA 0.99
4 2010 2172 204 204 1 NA 0.99
5 2065 2172 8 8 1 NA 0.99
6 2088 2172 1 1 1 NA 0.99
7 2153 2172 2 2 1 NA 0.99
8 2164 2172 19 19 1 NA 0.99
9 2170 2172 2 2 1 NA 8
10 2172 1919 227 227 1 NA 0.99
# ℹ 12 more rows
# ℹ 8 more variables: P_FNB_COUNT <dbl>, A_FINSERV_COUNT <dbl>,
# P_MRT_EXIT_COUNT <dbl>, A_ENTERTAIN_COUNT <dbl>, A_FNB_COUNT <dbl>,
# A_LEISURE_COUNT <dbl>, A_RETAIL_COUNT <dbl>, A_MRT_EXIT_COUNT <dbl>
3.1 Create SIM_DATA
- Combine Flow data with hexagon sf
show code
# CalcRSquared <- function(observed,estimated){
# r <- cor(observed,estimated)
# R2 <- r^2
# R2
# }
# CalcRSquared(uncSIM$data$SUM_TRIPS, uncSIM$fitted.values)
# 185274
# 185252 - 223.1 Create SIM_DATA
- Combine Flow data with hexagon sf
3.1 Create SIM_DATA
- Combine Flow data with hexagon sf
3.1 Create SIM_DATA
- Combine Flow data with hexagon sf